Bloom Filters and Locality - Sensitive Hashing

نویسنده

  • Kurt Mehlhorn
چکیده

This section follows 5.5.3 in Mitzenmacher/Upfal. Bloom Filters are a compact data structure for approximate membership queries. The data structure uses only linear space. It makes mistakes in the sense that non-members may be declared members with small probability. Let S = {e1, . . . , em} be the elements to be stored; S is a subset of some universe U . We use a boolean array T of length n. We assume that we have k hash functions that map U to the integers from 1 to n. The hash functions are assumed to be independent and random. We initialize T to the all-zero array and then set T [i] to one iff there is an element ej ∈ S and a hash function h` such that h`(ej) = i. Note that an array entry may be set to one by several elements. In order to answer a membership query for e, we compute the hash values h`(e), 1 ≤ ` ≤ k, and return “YES” if and only if T [h`(e)] = 1 for all `. Clearly, if e ∈ S, the answer will be YES. Also, if the answer is NO, then e 6∈ S. However, there is the possibility of false positives, i.e., the data structure might answer YES for elements e 6∈ S. What is the probability that e is a false positive? Let us first compute the probability that a particular T [i] stays zero. This is ( 1− 1 n )km = ( 1− 1 n )n·km/n ≈ e−km/n.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identification with encrypted biometric data

Biometrics make human identification possible with a sample of a bio-metric trait and an associated database. Classical identification tech-niques lead to privacy concerns. This paper introduces a new method toidentify someone using his biometrics in an encrypted way.Our construction combines Bloom Filters with Storage and Locality-Sensitive Hashing. We apply this error-...

متن کامل

Unified Locality-Sensitive Signatures for Transactional Memory

Transactional Memory (TM) systems must record the memory locations read and written by concurrent transactions in order to detect conflicts. Some TM implementations use signatures for this purpose, which summarize read and write sets in bounded hardware at the cost of false positives due to address aliasing. Signatures are usually implemented as two separate (one for reads and another for write...

متن کامل

XStreamCluster: An Efficient Algorithm for Streaming XML Data Clustering

XML clustering finds many applications, ranging from storage to query processing. However, existing clustering algorithms focus on static XML collections, whereas modern information systems frequently deal with streaming XML data that needs to be processed online. Streaming XML clustering is a challenging task because of the high computational and space efficiency requirements implicated for on...

متن کامل

Distance-Sensitive Bloom Filters

A Bloom filter is a space-efficient data structure that answers set membership queries with some chance of a false positive. We introduce the problem of designing generalizations of Bloom filters designed to answer queries of the form, “Is x close to an element of S?” where closeness is measured under a suitable metric. Such a data structure would have several natural applications in networking...

متن کامل

Evaluation of Scalable Pprl Schemes with a Native Lsh Database Engine

In this paper, we present recent work which has been accomplished in the newly introduced research area of privacy preserving record linkage, and then, we present our L-fold redundant blocking scheme, that relies on the Locality-Sensitive Hashing technique for identifying similar records. These records have undergone an anonymization transformation using a Bloom filterbased encoding technique. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016